HSS8005
  • Module plan
  • Resources
  • Materials
  • Data
  • Assessment
  • Canvas

Week 3 Computer Lab Worksheet

  • Weekly materials

  • Week 1
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 2
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 3
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 4
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 5
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 6
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 7
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 8
    • Notes
    • Presentation
    • Worksheet
    • Handout

On this page

  • Aims
  • Setup
  • Import data
  • Basic interactions
  • Replicating Model 1
  • Exercise 1
  • Exercise 2

Week 3 Computer Lab Worksheet

Interactions and causal inference

Author

Chris Moreh

Aims

In this session we continue working with data from Österman (2021) to replicate parts their analysis. In Week 2 we practised building very basic bivariate and multiple linear regression models. In this session, we will expand on those models and attempt to replicate (some of) the models reported in Table 3 of Österman (2021):

Setup

In Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.

You can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_3”). It may be useful to document more of your coding process and interpretations, so you want to work in a .qmd/.Rmd document rather than an .R script.

Import data

As a first step, let’s import the osterman dataset that underpins the Österman (2021) article (see the Data page on the course website for information about the datasets available in this course):

osterman <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005-data/osterman.dta")

Basic interactions

To continue on from the toy models that we fitted in Week 2, we may remember that we observed a change in the size of the effect and statistical significance of the gender variable (female) when also accounting for years of education and age. The two models compared as so:

m_m <- lm(trustindex3 ~ eduyrs25 + agea + female, data = osterman)

m_b <- lm(trustindex3 ~ female, data = m_m$model)

sjPlot::tab_model(m_b, m_m)
  trustindex 3 trustindex 3
Predictors Estimates CI p Estimates CI p
(Intercept) 5.24 5.22 – 5.26 <0.001 2.95 2.87 – 3.04 <0.001
female 0.01 -0.02 – 0.04 0.582 0.04 0.01 – 0.07 0.004
eduyrs 25 0.12 0.11 – 0.12 <0.001
Age of
respondent,calculated
0.02 0.01 – 0.02 <0.001
Observations 68211 68211
R2 / R2 adjusted 0.000 / -0.000 0.063 / 0.063

The question arises whether the gender variable interacts with the other predictors, particularly with education as that is the main focus of the research. In other words, we may have reasons to assume that the linear effect of education may be different for males and females. We can interact predictor variables in regression models using the : and * operators. With :, we include in the regression model only the interaction term, but not the component variables (i.e. their main effects); to include the component variables as well, we need to add them in as usual with the + operator. To note that we should normally want to include both main effects and and the interaction effects in a regression model. With *, we include both the main effects and the interaction effects, so it provides a shortcut to using :, but it is often useful to combine the two in more complex interaction scenarios.

In the code below, we interact gender with both education and age, and the two commands have the same result:

mxa <- lm(trustindex3 ~ eduyrs25 + agea +female + female*eduyrs25 + female*agea, data = osterman)

mxb <- lm(trustindex3 ~ female*(eduyrs25 + agea), data = osterman)

## Check that they are equivalent

sjPlot::tab_model(mxa, mxb)
  trustindex 3 trustindex 3
Predictors Estimates CI p Estimates CI p
(Intercept) 3.25 3.13 – 3.36 <0.001 3.25 3.13 – 3.36 <0.001
eduyrs 25 0.10 0.10 – 0.11 <0.001 0.10 0.10 – 0.11 <0.001
Age of
respondent,calculated
0.01 0.01 – 0.02 <0.001 0.01 0.01 – 0.02 <0.001
female -0.53 -0.69 – -0.36 <0.001 -0.53 -0.69 – -0.36 <0.001
eduyrs25:female 0.03 0.02 – 0.03 <0.001
agea:female 0.00 0.00 – 0.01 <0.001
female:eduyrs25 0.03 0.02 – 0.03 <0.001
female:agea 0.00 0.00 – 0.01 <0.001
Observations 68211 68211
R2 / R2 adjusted 0.064 / 0.064 0.064 / 0.064

We find that both interaction terms (eduyrs25:female and agea:female) are statistically significant, which can tell us that there is an interaction that may be useful to include in the model. The greatest challenge is in interpreting results from interaction models, as the usual interpretation of the main effect terms no longer stands and the focus should be on the interaction terms instead. The best approach is to visualise the interactions using a graph, and there are various user-written functions that make this task easy. Just to keep it simple, we will use the plot_model() function from the {sjPlot} package.

First, we can check the interaction between education and gender:

sjPlot::plot_model(mxa, type = "pred", terms = c("eduyrs25", "female"))

This visualization in now easier to interpret than the numerical output. We see a steeper line for females than for males and the lines cross each other at around 12 years of completed schooling. Years spent in education therefore have a stronger influence on social trust for females than for males after around 12 years (which is generally the length of compulsory education in most European countries from which the data is drawn). Among those with education below this level, we find that males have higher levels of social trust, but longer educational careers have a stronger impact for women.

We can check the same for the interaction of gender and age:

sjPlot::plot_model(mxa, type = "pred", terms = c("agea", "female"))

The general pattern is the same, however, the overlapping confidence intervals up until about the age of 53 tell us that the observed difference between the two genders may be due to random variation in our sample. The effect of this interaction is weaker and less reliable than the one with education, and unless we have a strong theoretically motivated reason to include it in our model, we may want to keep a simpler model without this interaction effect.

Replicating Model 1

With this basic understanding of interaction terms, we can now attempt to replicate Model 1 from the Osterman article. Based on the description provided in the article and the appended materials we can reconstruct the main components of their model as shown below. As a first step, run the command and compare our results to those presented in Model 1. Bear in mind that the effects most coefficients are not presented in the main table (see also the table’s footnotes and Appendix 4 in the Supplementary Materials). We also did not include any survey weights in our model and did not include any error corrections, which the author does; so the results will not be exactly the same.

In the footnotes to Table 3, the author tells us that in addition to what is reported in the table,

All models include reform FEs [Fixed Effects], a general quadratic birth year trend, and reform-specific trends for birth year (linear) and age (quadratic). Additional controls: foreign-born parents and ESS round dummies.

The version of the model summary presented in Table A.3 in the Appendix also includes the additional controls, but not the other terms.

In the main text, the author further explains some of the reasoning behind their modelling choices:

One dilemma for the design is that there has been a trend of increasing educational attainment throughout the studied time period, which means that the reform-windows of treated and non-treated cohorts will also pick up the effects of this trend. To counter this, [the list of covariates] includes a general quadratic birth year trend, reform-specific linear controls for birth year and reform-specific quadratic age trends. The quadratic terms are included to allow sufficient flexibility in absorbing possible non-linear trends of increasing education within the reform-window of seven treated and seven untreated birth year cohorts. … The reform-fixed effects are also essential because they imply that only the within-reform-window variation is used to estimate the effects and between-reform differences are factored out, such as pre-reform differences in social trust. (Österman 2021, 221–22)

Before we fit the model, some of the concepts in the quotation need unpacking. A quadratic term is a second-degree polynomial term - put simply, it’s the square of the variable concerned. The quadratic of a variable such as \(age\) is therefore \(age^2\), or \(age \times age\). In other words, it is like a variable’s “interaction” with itself. Because of this, there are several ways in which the quadratic terms to be included in a model can be specified:

  1. We could create the quadratic terms as new variables, and include those in the model, as below:
# First, load the original dataset
osterman <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005/data/osterman.dta")

# Create new quadratic variables by multiplication with themselves and add them to the dataset, saving it as a new data object
osterman2 <- osterman |> 
  mutate(agea_quad = agea*agea,        # quadratic age variable
         yrbrn_quad = yrbrn*yrbrn)     # quadratic birth-year variable

# We now have two additional variables in the dataset; we can fit the model using those:
m1_prequad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3
                   fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +     # additional controls for foreign-born parents and ESS Round dummies
                   agea + yrbrn + agea_quad + yrbrn_quad +                              # general quadratic birth year trend and quadratic age
                   factor(reform_id_num) +                                              # reform fixed effects dummies
                   yrbrn:factor(reform_id_num) +                                        # reform-specific birth year trend
                   agea:factor(reform_id_num) +  agea_quad:factor(reform_id_num),       # reform-specific quadratic age trend
               data = osterman2)                                                        # the new expanded dataset
  1. We can get the same results by creating the quadratic terms directly as part of the modelling function. The one thing we should keep in mind is that if we want to include standard mathematical operations within a formula function, we need to isolate or insulate the operation from R’s formula parsing code using the I() function, which returns the contents of I(...) “as.is”. The model formula would then be:
m1_funquad <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d +                         # main covariates reported in Table 3
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) +       # additional controls for foreign-born parents and ESS Round dummies
                 agea + yrbrn + I(agea^2) + I(yrbrn^2) +                                # general quadratic birth year trend and quadratic age
                 factor(reform_id_num) +                                                # reform fixed effects dummies
                 yrbrn:factor(reform_id_num) +                                          # reform-specific birth year trend
                 agea:factor(reform_id_num) +  I(agea^2):factor(reform_id_num),         # reform-specific quadratic age trend
              data = osterman)                                                          # the original dataset
  1. In the two previous options, the quadratic terms will be correlated with the original variables. To avoid this by relying on so-called orthogonal polynomials we should use the poly() function. We can also fit the same correlated polynomial model as the ones above by adding the raw = TRUE option to the poly() function. In the code below, we will fit the correlated version first, then the orthogonal version (This stackoverflow discussion explains in more detail the difference between the two options):
m1_polyraw <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + 
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + 
                 poly(agea, 2, raw = TRUE) + poly(yrbrn, 2, raw = TRUE) +
                 factor(reform_id_num) +           
                 yrbrn:factor(reform_id_num) + 
                 agea:factor(reform_id_num) + poly(agea, 2, raw = TRUE):factor(reform_id_num),
               data = osterman)

m1_polyorth <- lm(trustindex3 ~ reform1_7 + female + blgetmg_d + 
                 fbrneur + mbrneur + fnotbrneur + mnotbrneur + factor(essround) + 
                 poly(agea, 2) + poly(yrbrn, 2) +
                 factor(reform_id_num) +           
                 yrbrn:factor(reform_id_num) + 
                 agea:factor(reform_id_num) + poly(agea, 2):factor(reform_id_num),
               data = osterman)

It’s worth noting, however, that the Stata routine used by the author fitted correlated/raw coded polynomials, so the orthogonal version below is just for a comparison and we will not use it going forward. For a side-by-side overview comparison of the models we fitted so far, we could use a model summary tabulation function (e.g. sjPlot::tab_model(), which we have used before, or modelsummary::modelsummary(), stargazer::stargazer(), or jtools::export_summs(); you can read through the function documentation and test them out). Below we will use the modelsummary() function from modelsummary. The function takes a list() object as argument, and the list can include several models to be tabulated side-by-side. We can either build the list() object first and pass the object to modelsummary(), or we can build the list directly as part of the call to modelsummary(). We can also easily include more descriptive labels for the models that we want to tabulate within the lint() function, as we will do below:

# First, install the package:
install.packages("modelsummary")

# Second, we make a list of the models we want to summarise; we can even name them:
models <- list(
  "Pre-calculated quadratic" = m1_prequad,
  "Within-function quadratic" = m1_funquad,
  "poly() with raw coding" = m1_polyraw,
  "poly() with default orthogonal coding" = m1_polyorth
)

# modelsummary table with stars for p-values added, as in the published article
modelsummary::modelsummary(models, stars = TRUE)

The results from the modelsummary() are not shown here because it’s a long and ugly table, but it’s useful for perusing to compare the results across the different models. For a cleaner table showing only the results included in Table A.3 in the Appendix to Österman (2021), we can use the coef_map or the coef_omit option in modelsummary() and only include m1_funquad:

# It's cleaner to first make a vector of the coefficients we wish to include; we can name the coefficients as they appear in Table A.3; note that we also leave out the Intercept, as in the published table:
coefs <- c("reform1_7" = "Reform",
           "female" = "Female",
           "blgetmg_d" = "Ethnic minority",
           "fbrneur" = "Foreign-born father, Europe",
           "mbrneur" = "Foreign-born mother, Europe",
           "fnotbrneur" = "Foreign-born father, outside Europe",
           "mnotbrneur" = "Foreign-born mother, outside Europe",
           "factor(essround)2" = "ESS Round 2",
           "factor(essround)3" = "ESS Round 3",
           "factor(essround)4" = "ESS Round 4",
           "factor(essround)5" = "ESS Round 5",
           "factor(essround)6" = "ESS Round 6",
           "factor(essround)7" = "ESS Round 7",
           "factor(essround)8" = "ESS Round 8",
           "factor(essround)9" = "ESS Round 9")

# Then we pass the vector to coef_map to select the coefficients to print
modelsummary::modelsummary(list("M1" = m1_funquad), stars = TRUE, coef_map = coefs)
M1
Reform 0.063*
(0.027)
Female 0.058***
(0.013)
Ethnic minority −0.241***
(0.054)
Foreign-born father, Europe −0.111**
(0.042)
Foreign-born mother, Europe −0.108*
(0.044)
Foreign-born father, outside Europe −0.065
(0.073)
Foreign-born mother, outside Europe −0.110
(0.078)
ESS Round 2 0.059
(0.045)
ESS Round 3 0.162*
(0.075)
ESS Round 4 0.243*
(0.108)
ESS Round 5 0.360*
(0.144)
ESS Round 6 0.397*
(0.179)
ESS Round 7 0.449*
(0.212)
ESS Round 8 0.655**
(0.246)
ESS Round 9 0.816**
(0.283)
Num.Obs. 68796
R2 0.200
R2 Adj. 0.198
AIC 268913.8
BIC 270056.1
Log.Lik. −134331.877
RMSE 1.71
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

We can compare our table to the published one to confirm that the results are very close, but that there are some inconsistencies that we will address later in Worksheet 6.

Exercise 1

This model provides a lot to play around with. For this exercise, start from this complex model and start simplifying it in various ways. Try excluding some of the interaction terms and check the change in the results. Plot the interaction terms and attempt an interpretation. Note down your thoughts and models. Playing around with this model is a good way to learn about the effect of interactions.

Exercise 2

Read through the Osterman article and check your understanding of how the author interprets these results. Particularly, how does the interaction model help us elucidate causal factors in the regression model?

References

David, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.
El-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.
Lord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.
Mulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.
Österman, Marcus. 2021. “Can We Trust Education for Fostering Trust? Quasi-experimental Evidence on the Effect of Education and Tracking on Social Trust.” Social Indicators Research 154 (1): 211–33. https://doi.org/10.1007/s11205-020-02529-y.
Senn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489.
Presentation
Handout